-
-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crc32 clmul #64
Crc32 clmul #64
Conversation
Hello! Thanks for the PR, this is something we have wanted to implement since CLMUL was added for CRC64. 70% speed up over the generic method is a great speedup! If you are willing, can you do some additional benchmarks for us since you already have a framework setup? We are wondering what impact the compiler has, so can you show us differences between using GCC and Clang? This especially matters when it comes to the 3% speed up you mentioned for the inline asm. 3% isn't that significant, especially if its only for CRC32. It adds extra complexity to the code and makes it harder to maintain long-term, so we want to make sure it is worth it. Similarly, can you try making CRC_SIMD_BODY an inline function instead of a macro? This could make it easier to read/maintain. If it has a significant impact on performance then we should stick to a macro. So in summary, can you benchmark:
You don't need to make this change now, but before merging it would be great if you can clean up the commits:
Feel free to add fix up commits as we go through the review process but at the end we will need these changes. |
src/liblzma/check/crc_clmul_common.h
Outdated
@@ -1,7 +1,7 @@ | |||
/////////////////////////////////////////////////////////////////////////////// | |||
// | |||
/// \file crc_macros.h | |||
/// \brief Some macros for CRC32CLMUL and CRC64CLMUL | |||
/// \file crc_common.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should match the name of the file (crc_clmul_common.h)
src/liblzma/check/crc_clmul_common.h
Outdated
# define CRC_CLMUL 1 | ||
|
||
/* | ||
// The generic code is much faster with 1-8-byte inputs and has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this with crc32? I would guess small inputs it have the same performance effect as crc64 but we might want to improve the comment if the impact is difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tested this I found that for both crc32 and crc64 they were faster than generic after 5 bytes.
src/liblzma/check/crc_clmul_common.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems worth combining this file and crc_macros.h into a single common crc header file. At this point crc_clmul_common.h does more than just clmul things since it also defines CRC_GENERIC.
I wouldn't put these things into check.h since they are specific for just crc32 and crc64. So maybe crc_common.h make sense after all.
src/liblzma/check/crc32_table.c
Outdated
|| (defined(__e2k__) && __iset__ >= 6) | ||
// No table needed but something has to be exported to keep some toolchains | ||
// happy. Also use a declaration to silence compiler warnings. | ||
extern const char lzma_crc32_dummy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a commit to master to avoid having to make a new symbol to avoid this warning. Please apply the same change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I cannot find a reference right now but I have a feeling that some toolchain on some proprietary UNIX-like OS didn't like object files that don't export anything. The trick for the Windows build is fine because it's not a problem there with the GNU toolchain. That is, an empty translation unit in C and an object file that exports nothing are related but distinct issues. typedef works for avoiding empty translation units but not empty object files.
I might remember wrong and in that case I'm sorry for the noise. But if I'm wrong and the commit in the master is retained then its comment should be updated as no symbols are exported after that commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept searching but I cannot find anything to confirm my memory. So I apologize for the noise. I updated the comment in the master.
I tested the difference that using GCC and Clang made in general and found that when using Clang instead of GCC there was negligible difference. The difference that using GCC and Clang made on the inline assembly was a 2% increase on GCC and 1% or less for Clang. Since this increase is not very significant I can get rid of the changes if you would like. Replacing CRC_SIMD_BODY with an inline function had no change to the runtime. Ill upload the Inline function as an extra commit, and squash it once you decide which one you like better. |
src/liblzma/check/crc32_fast.c
Outdated
/// | ||
/// crc32_clmul uses 32/64-bit x86 SSSE3, SSE4.1, and CLMUL instructions. | ||
/// It was derived from | ||
/// https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page is no longer available, so this is a good time to update the link if we can. Is there another publicly available link where the PDF can be viewed? If not, we should remove the link altogether. The GitHub link works so we can at least update the (URLs were checked on ) part.
src/liblzma/check/crc32_table.c
Outdated
&& defined(__SSE4_1__) && defined(__PCLMUL__)) \ | ||
|| (defined(__e2k__) && __iset__ >= 6) | ||
// No table needed. Use a typedef to avoid an empty translation unit. | ||
typedef void lzma_crc64_dummy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be lzma_crc32_dummy
src/liblzma/check/crc32_table.c
Outdated
|
||
// FIXME: Compared to crc32_fast.c this has to check for __x86_64__ too | ||
// so that in 32-bit builds crc32_x86.S won't break due to a missing table. | ||
#if (defined(__x86_64__) && defined(__SSSE3__) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This #if needs to be:
#if !defined(HAVE_ENCODERS) && ((defined(__x86_64__) && defined(__SSSE3__) \ && defined(__SSE4_1__) && defined(__PCLMUL__)) \ || (defined(__e2k__) && __iset__ >= 6))
The !defined(HAVE_ENCODERS)
part is needed because the LZ encoder needs the lzma_crc32_table
. See lz_encoder_hash.h if you are interested.
#if (defined(__GNUC__) || defined(__clang__)) && !defined(__EDG__) | ||
__attribute__((__target__("ssse3,sse4.1,pclmul"))) | ||
#endif | ||
#if lzma_has_attribute(__no_sanitize_address__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are using crc_simd_body()
as an inline function, we probably do not need __attribute__((__no_sanitize_address__))
on crc32_clmul()
and crc64_clmul()
. Can you verify with both GCC and Clang? If you haven't used ASAN before, pass -fsanitize=address
in CFLAGS when configuring.
I have made all of the changes listed above. I am also planning to work on implementations for arm versions of crc32_clmul and crc64_clmul after this is finished. |
I'm sorry for the delay. Neither Jia or I have been able to look at this in the past days. :-( We are both happy to see CLMUL version of CRC32 and it's great if you plan to do ARM64 versions too. :-) The inline function version is definitely nicer when the speed is the same. So those changes should be squashed accordingly, thanks! For a moment I thought that keeping crc_macros.h as is and adding crc_clmul.h would be nicer but, as Jia has pointed out, crc_common.h defines CRC_GENERIC and such too so I guess it is better this way. Many small bits of code depend on each other in such ways that it seems impossible to make things look very pretty. In my experience it's nice if file renames are done as separate commits with only the mandatory edits. For example, the Small commits in are preferred whenever doing so makes sense. I wonder if it made sense to have crc_clmul.c with both CRC32 and CRC64 because then the binary wouldn't end up with two copies of is_clmul_supported() and crc_simd_body(). However, it's possible that crc_simd_body() has to be inlined if the function call overhead is too high for tiny input buffers. I feel it might be good to merge this after the inline function change has been squashed so that we have some good version committed in xz.git. So feel free to try the crc_clmul.c idea if you wish but it's not required for merging. Have you tested on 32-bit x86? If not, it's fine. :-) If yes: I haven't checked performance on 32-bit x86 in years and wonder if the assembly files still make sense compared to what GCC and Clang can do (for processors that don't support CLMUL). Those files were written in GCC 3.3/3.4 times. It shouldn't be hard to make 32-bit x86 autodetect between the assembly code and CLMUL so I can do it if it is worth it. Thanks! |
src/liblzma/check/crc_common.h
Outdated
/////////////////////////////////////////////////////////////////////////////// | ||
// | ||
/// \file crc_common.h | ||
/// \brief Some functions and macros for CRC32CLMUL and CRC64CLMUL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has macros for the generic versions too so "for CRC32 and CRC64" would be better.
I've started work on the changes. Don't worry about the delays, I appreciate that both of you are taking the time to look at this. I haven't tested on 32-bit x86 yet. |
crc64_fast.c was updated to use the code from crc_common.h instead.
I updated the PR with the squashing and comment change. I didn't try the crc_clmul.c idea but I believe it would result in cleaner code. I'll let you all handle it.
I hadn't tested the assembly version before so I gave it a try since it seemed interesting. I compiled my test program and liblzma with the -m32 GCC flag and ran the benchmark on my 64-bit machine. I don't have a 32-bit machine to test on. The results were somewhat surprising considering how old the assembly implementation is. I didn't have time to make a pretty graph again, but here is a quick summary of my findings:
The CRC64 CLMUL version became faster with buffers around 512 bytes. The runtime differences started to change between 32 - 1024 bytes so it was most interesting to categorize them as < 32 bytes and > 1024 bytes. So for CRC32 you are better off using the assembly version but CRC64 depends. |
Thanks for the updates!
Thanks for benchmarking the 32-bit version. I would have expected the CLMUL version to be much better than the assembly or the generic to be fairly close to the assembly. We'll take that into account when deciding how to proceed with 32-bit builds. |
I'm glad this has been merged. Thanks! A few thoughts about 32-bit x86: Testing on modern 64-bit processor running 32-bit code was what I had in mind, so that part is fine. The results look strange though. The 32-bit assembly code implements the same algorithm as the generic C code. The CLMUL version should easily be faster with 1024-byte buffers, even if those were unaligned. With GCC 3.3/3.4 I remember that GCC couldn't fit all hot variables in registers and the resulting extra stack access was bad for speed. On x86-64 this isn't a problem anymore. My expectation is that the 32-bit x86 assembly code for CRC32 should have similar speed as the generic C code has on x86-64. I don't have clear expectations of the speed of the C code on 32-bit x86. On 32-bit x86, CRC64 benefits more from CLMUL because 32-bit x86 doesn't have 64-bit general-purpose registers. With the generic method, including the assembly implementation, the 64-bit CRC value needs two registers and updating it needs more instructions. I wondered if eight xmm registers could be a limiting factor on 32-bit x86. However, on x86-64 exactly eight xmm registers are used by both CRC32 and CRC64 CLMUL implementations with GCC 13.2.1. So I suppose the number of xmm registers shouldn't be a problem. We (or likely it's mostly Jia) will do a few tests later. Thanks again! |
@hansjans162 We committed some changes to reorganize the CLMUL code. We refactored things so all the CLMUL specific code is in a new A few small changes were made to |
Thank you for the update, and I agree that it should not be much extra work to incorporate the changes into the code I have already written. |
While working on implementing arm support for crc_clmul I found that the processor I am using does not have support for PMULL. I am not going to continue work on this at the moment since the devices I have can't test my code, but I might continue later if I get hardware that supports this. |
Thank you for letting us know. If you are able to continue, please let us know too to ensure that no duplicate work will happen (unlikely but still). |
yall can you check this :3 uwu |
Sounds like a great idea! With the best of intentions, what could possibly go wrong.. |
Do you have reason to believe these commits are malicious? They're from a different author |
@qwertychouskie - Are they? Everything is open for discussion and it seems there is more and more unravelling each day/week |
Added an implementation for crc32 that makes use of clmul.
Code for this implementation was written by Ilya Kurdyukov and can be found here.
https://github.com/ilyakurdyukov/crc-clmul-sim
Also refactored crc64_clmul to use the new macros created for crc32_clmul.
As well as moved similar functions to crc_clmul_common to eliminate duplicate code.
I tested this on files doubling in size starting from 1 byte up to 1 Gigabyte.
During testing I found that crc32_clmul can run up to 70% faster than crc32_generic,
and has an average speed increase of 58.4% for sizes greater than 16 bytes.
I also used this to test the reworked version of crc64_clmul.
This version of crc64_clmul is an average of 3.9% faster than the original implementation.
This speed increase is due to some inline assembly as well as changing around the order of some if statements.
Pull request checklist
Please check if your PR fulfills the following requirements:
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
crc32_fast currently only uses a generic implementation.
What is the new behavior?
I added an implementation for crc32 that makes use of clmul in crc32_fast.c
Also refactored crc64_clmul implementation to use the same macros as crc32_clmul
Does this introduce a breaking change?
Other information
Here is the output from both tests that gave me the statistics above.
the number of unique files tested on and the number of times the crc is run
decrease as the bytes get larger so the benchmark does not take too long.
The 64 and 32 spd+ show the percentage speed increase over the generic.
The percentage on the graph show the combined average for all types.
For example (50% is twice as fast, 200% is twice as slow)
The 64old and 64new spd+ show the percentage speed increase over the generic.
The percentage on the graph show the combined average for all types.